import pandas as pd
import numpy as np
import datetime as dt
from datetime import timedelta
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn import linear_model
from sklearn.feature_selection import RFE
from sklearn.metrics import r2_score
plt.style.use('seaborn')
sns.set(style = 'white')
%matplotlib inline
The first dataset is an export of my ride data from Strava, an online social
network site for cycling and other sports. This data is a log of every ride since the start of 2018
and contains summary data like the distance and average speed. It was exported using
the script stravaget.py which uses the stravalib module to read data. Some details of
the fields exported by that script can be seen in the documentation for stravalib.
The exported data is a CSV file so that's easy to read, however the date information in the file is recorded in a different timezone (UTC) so we need to do a bit of conversion. In reading the data I'm setting the index of the data frame to be the datetime of the ride.
strava = pd.read_csv('data/strava_export.csv', index_col='date', parse_dates=True)
# strava.index = strava.index.tz_convert('UTC')
strava.head()
The second dataset comes from an application called GoldenCheetah which provides some analytics services over ride data. This has some of the same fields but adds a lot of analysis of the power, speed and heart rate data in each ride. This data overlaps with the Strava data but doesn't include all of the same rides.
Again we create an index using the datetime for each ride, this time combining two columns in the data (date and time) and localising to Sydney so that the times match those for the Strava data.
cheetah = pd.read_csv('data/cheetah.csv', skipinitialspace=True)
cheetah.index = pd.to_datetime(cheetah['date'] + ' ' + cheetah['time'])
cheetah.index = cheetah.index.tz_localize('Australia/Sydney')
cheetah.head()
The GoldenCheetah data contains many many variables (columns) and I won't go into all of them here. Some that are of particular interest for the analysis below are:
Here are definitions of some of the more important fields in the data. Capitalised fields come from the GoldenCheetah data while lowercase_fields come from Strava. There are many cases where fields are duplicated and in this case the values should be the same, although there is room for variation as the algorithm used to calculate them could be different in each case.
Some of the GoldenCheetah parameters are defined in their documentation.
Your first task is to combine these two data frames using the join method of Pandas. The goal is to keep only those rows of data
that appear in both data frames so that we have complete data for every row.
We merge the strava data with the cheetah data using their indices to match observations from one dataset to the other. We specify an inner join to capture the observations that are common in both datasets.
cycling = strava.join(cheetah, how = "inner")
print('The dimensions of the Cycling Data is: {}'.format(cycling.shape))
The resulting dataset, which we named the cycling data, has 243 observations (rows) and 372 variables (columns).
We remove the cycling rides that have no recorded power reading as these are considered as commutes and MTB rides.
cycling_complete = cycling[cycling["device_watts"] == True]
print('The dimensions of Cycling Data after removing commutes and MTB rides is: {}'.format(cycling_complete.shape))
There are 38 rides with such condition and when omitted from our cycling dataset, we are left with 209 observations. We named this the cycling_complete dataset.
The workout_type variable is currently coded as character and we proceed with converting it into categorical data to be used in our succeeding analysis. The elevation gain data is encoded with it's unit of measurement (meters). We extract the numeric part of elevation_gain and drop the unit of measurement.
cycling_complete['workout_type'] = cycling_complete.workout_type.astype('category')
cycling_complete['elevation_gain'] = cycling_complete.elevation_gain.str.replace('[^\d\.\d]', '').astype('float')
Since our dataset contains 372 variables and we are interested on only a handful of them, we define an object that contains our variables of interest. Let's display the first 5 observations of the data frame with the relevant variables only.
factors = ["distance", "moving_time", "Average Speed", "average_heartrate", "average_watts", "NP", "TSS", "elevation_gain", "workout_type"]
cycling_complete[factors].head()
cycling_complete[factors].describe()
The average distance travelled when cycling is 35.52 kilometers, with half of the rides covering more than 32.10 kilometers and the farthest cycling distance being 108.20 kilometers. On average, the cycling speed is 25.71 km/hr and 50% of the rides have over 25.11 km/hr cycling speed with the fastest being 38.35 km/hr. The average heart rate per ride is 123.67 bpm with the highest being 156.5 bpm. Note that there are 20 rides in our dataset that does not have an average heart rate reading. The average power per ride is 167.63 watts, average normalised power is 223.12 watts and the average training stress score is 100.54. The highest elevation climbed on a ride is 1,474 meters and the average elevation climbed per ride is 359.58 meters.
We now look at the distribution of key variables in our dataset: distance, moving time, average speed, average power and training stress score. The observations on the distribution of the variables are discussed after the last histogram plot.
plt_dist = cycling_complete.distance.hist(density = 0, bins = 15, edgecolor = 'black')
plt_dist.set_title('Distribution of Distance', size = 15)
plt_dist.set_xlabel('Distance (in kilometers)', size = 13)
print("Skewness of the distribution of distance: ", cycling_complete.distance.skew())
plt_time = cycling_complete.moving_time.hist(density = 0, bins = 15, edgecolor = 'black')
plt_time.set_title('Distribution of Moving Time', size = 15)
plt_time.set_xlabel('Moving Time (in minutes)', size = 13)
print("Skewness of the distribution of time: ", cycling_complete.moving_time.skew())
plt_speed = cycling_complete['Average Speed'].hist(density = 0, bins = 15, edgecolor = 'black')
plt_speed.set_title('Distribution of Average Speed', size = 15)
plt_speed.set_xlabel('Average Speed (in kilometers per hour)', size = 13)
print("Skewness of the distribution of average speed: ", cycling_complete['Average Speed'].skew())
plt_watts = cycling_complete.average_watts.hist(density = 0, bins = 15, edgecolor = 'black')
plt_watts.set_title('Distribution of Average Watts', size = 15)
plt_watts.set_xlabel('Average Watts (in watts)', size = 13)
print("Skewness of the distribution of average power: ", cycling_complete.average_watts.skew())
plt_tss = cycling_complete.TSS.hist(density = 0, bins = 15, edgecolor = 'black')
plt_tss.set_title('Distribution of Training Stress Score', size = 15)
plt_tss.set_xlabel('Training Stress Score', size = 13)
print("Skewness of the distribution of training stress score: ", cycling_complete.TSS.skew())
By looking at the histograms, the variables don't seem to follow the normal distribution. We also compute for the skewness of each distribution and conclude that all of them are positively skewed.
Let's define a function that displays the Pearson's correlation coefficient for each variable pair in our dataset.
def corrdot(*args, **kwargs):
corr_r = args[0].corr(args[1], 'pearson')
corr_text = f"{corr_r:2.2f}".replace("0.", ".")
ax = plt.gca()
ax.set_axis_off()
marker_size = abs(corr_r) * 10000
ax.scatter([.5], [.5], marker_size, [corr_r], alpha = 0.6, cmap = "coolwarm",
vmin = -1, vmax = 1, transform = ax.transAxes)
font_size = abs(corr_r) * 40 + 5
ax.annotate(corr_text, [.5, .5,], xycoords = "axes fraction",
ha = 'center', va = 'center', fontsize = font_size)
corr_plot = sns.PairGrid(cycling_complete[factors])
corr_plot.map_diag(plt.hist, bins = 15, edgecolor = "black")
corr_plot.map_lower(plt.scatter, edgecolor = "white")
corr_plot.map_upper(corrdot)
Significant relationships can be observed among our variables of interest. The size of the circle represents the strength of the relationship between the variables, the figure inside is the Pearson correlation coefficient, with values ranging from -1 (perfectly negative correlation) to 1 (perfectly positive correlation), and the stronger the shade of red, the more significant the relationship is:
cycling_complete.workout_type.value_counts().plot.pie(autopct='%1.1f%%')
The cycling dataset is composed of 73.7% rides, and 14.1% races and 12.2% workouts. We'll look at a subset of variables and highlight the differences between each workout type.
We'll define a function that will serve as a template for our boxplot to compare each workout type.
def mod_boxplot(x, y, ax0, ax1, xlabel):
ax = sns.boxplot(x = x, y = y, data = cycling_complete, ax = axs[ax0, ax1])
ax.set_xlabel(xlabel, size = 25)
ax.set_xticklabels(ax.get_xmajorticklabels(), size = 25)
ax.set_ylabel('')
ax.set_yticklabels(ax.get_ymajorticklabels(), size = 25)
return ax
fig, axs = plt.subplots(ncols = 2, nrows = 3, figsize=(30,25))
mod_boxplot('distance', 'workout_type', 0, 0, 'Distance')
mod_boxplot('moving_time', 'workout_type', 0, 1, 'Moving Time')
mod_boxplot('Average Speed', 'workout_type', 1, 0, 'Average Speed')
mod_boxplot('average_heartrate', 'workout_type', 1, 1, 'Average Heart Rate')
mod_boxplot('average_watts', 'workout_type', 2, 0, 'Average Power')
mod_boxplot('NP', 'workout_type', 2, 1, 'Normalised Power')
The median distance covered during rides and races are not that different but the spread/range of values is wider for rides due to more datapoints. The median distance covered for workouts are less than both distance covered on race and rides.
Races are finished in a shorter time as compared to rides even if they have identical median distance covered. Workouts are finished in a shorter time compared to rides. Again, the spread/range of values for rides is wider compared to the other two types of workout and it is due to more datapoints available for rides.
The average speed, average heart rate, average power and normalised power are higher for races as compared to rides and workouts. Rides have the lowest average speed, average heart rate, average power and normalised power among the three workout types.
corr_plot_workout = sns.pairplot(cycling_complete[['distance', 'moving_time', 'Average Speed', 'average_heartrate', 'average_watts', 'NP', 'workout_type']], hue = "workout_type")
The scatterplot matrix confirms the observations made above.
cycling_complete.kudos.describe()
On average, a ride receives 11.53 kudos. The highest number of kudos that a ride got is 24 and the lowest is 2.
plt_kudos = cycling_complete.kudos.hist(density = 0, bins = 15, edgecolor = 'black')
plt_kudos.set_title('Distribution of Kudos', size = 15)
plt_kudos.set_xlabel('Count of Kudos', size = 13)
print("Skewness of the distribution of kudos: ", cycling_complete.kudos.skew())
The kudos distribution does not follow a normal distribution, its skewness is 0.22 which means the distribution is moderately symmetic.
sns.pairplot(data = cycling_complete, y_vars = ['kudos'], x_vars = factors[0:4], kind = 'reg')
sns.pairplot(data = cycling_complete, y_vars = ['kudos'], x_vars = factors[4:8], kind = 'reg')
sns.boxplot(x = 'kudos', y = 'workout_type', data = cycling_complete)
Generally, kudos is positively correlated with the key variables as displayed in the scatterplots above. Races get higher kudos as compared to rides and workouts. Workouts have the lowest median kudos out of the three workout types.
We first get a subset of the cycling data containing all the variables of interest. We compute for the start of the month to be used as the index of our plot, extract the month-year information from it and convert them into string data. Then, we perform aggregation on the month-year variable and calculate the total distance, total stress and average of average speed. We then create plots for the aggregated data and label the plot as necessary.
plot_cycling = cycling_complete[['distance', 'TSS', 'Average Speed']]
plot_cycling['month'] = (plot_cycling.index - pd.offsets.MonthBegin(1)).date
plot_cycling.head()
table1 = plot_cycling.groupby('month').sum()[['distance', 'TSS']]
table2 = plot_cycling.groupby('month').mean()['Average Speed']
monthly_plot = table1.join(table2, how = "inner")
monthly_plot = monthly_plot.reset_index()
monthly_plot.columns = ['month', 'total_distance', 'total_stress', 'average_speed']
monthly_plot = monthly_plot.sort_values('month')
monthly_plot['month_year'] = monthly_plot.month.apply(lambda x: x.strftime('%B %Y'))
monthly_plot.head()
fig, ax = plt.subplots(figsize = (15, 5))
distance_plot = sns.barplot('month_year', 'total_distance', data = monthly_plot, color = 'red',
ax = ax, alpha = 0.5, label = 'Total Distance')
distance_plot.set(yticks = [], ylabel = 'Total Stress')
distance_plot.set_xticklabels(distance_plot.get_xticklabels(), rotation = 45)
distance_plot.yaxis.labelpad = 37
for i, v in enumerate(monthly_plot.total_distance):
distance_plot.text(i - 0.2,
v / monthly_plot.total_distance[i] + 20,
'%i' % monthly_plot.total_distance[i],
fontsize = 12,
fontweight = 'bold')
ax2 = ax.twinx()
stress_plot = sns.lineplot('month_year', 'total_stress', data = monthly_plot, color = 'teal',
ax = ax2, marker = 'o', markersize = 10, alpha = 0.5, label = 'Total Stress', sort = False)
stress_plot.set(ylabel = '', ylim = (0, None))
ax3 = ax2.twinx()
speed_plot = sns.lineplot('month_year', 'average_speed', data = monthly_plot, color = 'blue', ax = ax3,
marker = 'o', markersize = 10, alpha = 0.4, label = 'Average Speed', sort = False)
speed_plot.set(ylabel = 'Average Speed', ylim = (0, 35))
ax.legend(loc = (.86, .9), frameon = False)
ax2.legend( loc = (.86, .84), frameon = False)
ax3.legend( loc = (.86, .78), frameon = False)
Total distance and stress seem to move similarly with each other month to month. The average speed, on the other hand, don't seem to be proportional with either distance and stress.
A similar approach is done for the plot by date. This time, we create a function that takes in a month as a parameter and displays the plot for that month in the two years that span our dataset.
plot_cycling['date'] = plot_cycling.index.date
plot_cycling.head()
table1 = plot_cycling.groupby('date').sum()[['distance', 'TSS']]
table2 = plot_cycling.groupby('date').mean()['Average Speed']
daily_plot = table1.join(table2, how = "inner")
daily_plot = daily_plot.reset_index()
daily_plot['month'] = daily_plot.date.apply(lambda x: x.strftime('%B'))
daily_plot.columns = ['date', 'total_distance', 'total_stress', 'average_speed', 'month']
daily_plot.date = daily_plot.date.apply(lambda x: x.strftime('%Y-%m-%d'))
daily_plot.head()
def monthly_cycling(m):
month_subset = daily_plot[daily_plot.month == m].reset_index()
fig, ax = plt.subplots(figsize = (15, 5))
distance_d_plot = sns.barplot('date', 'total_distance', data = month_subset, color = 'red',
ax = ax, alpha = 0.5, label = 'Total Distance')
distance_d_plot.set(yticks = [], ylabel = 'Total Stress')
distance_d_plot.set_xticklabels(distance_d_plot.get_xticklabels(), rotation = 45)
distance_d_plot.yaxis.labelpad = 32
for i, v in enumerate(month_subset.total_distance):
distance_d_plot.text(i,
v / month_subset.total_distance[i],
'%i' % month_subset.total_distance[i],
fontsize = 12,
fontweight = 'bold')
ax2 = ax.twinx()
stress_d_plot = sns.lineplot('date', 'total_stress', data = month_subset, color = 'teal',
ax = ax2, marker = 'o', markersize = 10, alpha = 0.5, label = 'Total Stress')
stress_d_plot.set(ylabel = '', ylim = (0, None))
ax3 = ax2.twinx()
speed_d_plot = sns.lineplot('date', 'average_speed', data = month_subset, color = 'blue', ax = ax3,
marker = 'o', markersize = 10, alpha = 0.5, label = 'Average Speed')
speed_d_plot.set(ylabel = 'Average Speed', ylim = (0, None))
ax.legend(loc = (1.03, .9), frameon = False)
ax2.legend( loc = (1.03, .84), frameon = False)
ax3.legend( loc = (1.03, .78), frameon = False)
return ax, ax2, ax3
monthly_cycling('January')
monthly_cycling('July')
This notebook is a partial replication of the analysis exploring appliance energy consumption based on the research by Luis M. Candanedo, Véronique Feldheim and Dominique Deramaix. The research paper discusses how various variables, such as temperature, humidity and other weather parameters, can predict appliance energy consumption using different machine learning methods. Linear regression, support vector machines, random forests and gradient boosting machines were used for prediction and recursive feature elimination (RFE) was used to rank the predictor variable importance. This notebook will focus on the replication of the multiple linear regression method with RFE to select relevant features.
The dataset is collected from a house in Stambruges designed to have low energy consumption. The energy (Wh) data is recorded every 10 minutes to capture quick changes in energy consumption in the house. Lights energy consumption is also collected using a sub-metered load. Temperature and humidity reading are also taken from each room in the house. All these information were taken for 137 days and combined to form the energy dataset. We proceed with conducting exploratory analysis on this data.
energy = pd.read_csv('data/energydata_complete.csv')
energy.date = pd.to_datetime(energy.date)
print('Dimensions of the energy dataset: ', energy.shape)
energy.head()
The dataset contains 19,735 observations and 29 variables (2 of which are random variables and will be removed from our dataset later on). The description of the data can be found below.
Data variables and their description are listed below. We remove variables rv1 and rv2 as these are random variables that were not used in the prediction.
energy = energy.drop(['rv1', 'rv2'], axis = 1)
energy.head()
print('Dimensions of complete energy dataset: ', energy.shape)
We create extra features (number of seconds from midnight, week status and day of the week) to be included when fitting multiple linear regression model. Week status and day of the week are transformed as categorical variables and we impose ordering on the day of the week variable setting Sunday as the start of the week:
days = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
energy['NSM'] = energy.date.apply(lambda x: x.hour * 3600 + x.minute * 60 + x.second)
energy['week_status'] = energy.date.apply(lambda x: 'weekend' if (x.weekday() == 5 | x.weekday() == 6) else 'weekday').astype('category')
energy['day'] = energy.date.dt.day_name().astype('category')
We look at the overall trend of energy consumption of appliances for the entire period in the dataset.
plt.figure(figsize = (20,6))
sns.lineplot(x = 'date', y = 'Appliances', data = energy).set(xlabel = 'Date', ylabel = 'Applicances Wh')
plt.title('Energy Consumption', size = 15)
We can observe multiple spikes in the data which means increased energy consumption during certain dates. We also take a closer look at the first week of data below.
plt.figure(figsize = (20, 6))
sns.lineplot(x = energy.date[0:1008], y = energy.Appliances[0:1008]).set(xlabel = 'Time (week 1)', ylabel = 'Applicances Wh')
plt.title('Energy Consumption for Week 1', size = 15)
plt.figure(figsize=(12, 5))
energy.Appliances.plot.hist(edgecolor = 'black', bins = 40)
energy.Appliances.describe()
plt.figure(figsize=(12, 5))
sns.boxplot(energy.Appliances)
The energy consumption data is highly skewed to the right, as seen on the histogram and boxplots below, having average Wh higher than the median Wh. The average energy consumption is 97.69 Wh and 50% of the data have higher than 60 Wh usage.
We transform our dataset to round down to the nearest hour and aggregate the energy consumption by date and hour. We also create a weekday and week number variable for exploratory data analysis.
energy['mhr'] = energy.date.apply(lambda x: x.replace(second = 0, minute = 0))
energy_total_per_hour = pd.DataFrame(energy.groupby('mhr').sum().Appliances).reset_index()
energy_total_per_hour['day'] = energy_total_per_hour.mhr.dt.day_name().astype('category', categories = days, ordered = True)
energy_total_per_hour['week_no'] = (energy_total_per_hour.mhr + dt.timedelta(days = 3)).dt.week
energy_total_per_hour.head()
def weekly_heatmap(n, idx):
df = energy_total_per_hour[energy_total_per_hour.week_no == n]
df['hour'] = df.mhr.apply(lambda x: x.hour)
fig = sns.heatmap(df.pivot('hour', 'day', 'Appliances'), cmap = 'YlOrRd', ax = axs[idx], vmax = 3600,
cbar_kws = {'shrink': 0.75, 'label': 'Appliances', 'orientation': 'horizontal'}).invert_yaxis()
return fig
A subset (weeks 3 to 6) of the aggregated energy consumption by date and hour is plotted below using a heatmap with the day of the week on the x-axis and the hour of the day (0 being midnight) on the y-axis.
fig, axs = plt.subplots(ncols = 4, figsize=(18,10))
weekly_heatmap(3, 0)
weekly_heatmap(4, 1)
weekly_heatmap(5, 2)
weekly_heatmap(6, 3)
High energy consumption can be observed during the afernoon and night time, as well as the weekends which might be due to more appliance usage in the house as most family members are home during these times.
Next, we explore the correlation between the energy consumption of appliances and the rest of the variables. We'll define functions to speed up the creation of scatterplot and correlation matrices.
plot1_cols = ['Appliances', 'lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3']
plot2_cols = ['Appliances', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6']
plot3_cols = ['Appliances', 'T7', 'RH_7', 'T8', 'RH_8', 'T9', 'RH_9']
plot4_cols = ['Appliances', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility', 'Tdewpoint', 'NSM']
def corr_plot(cols):
df = energy[cols]
fig = sns.PairGrid(df)
fig.map_diag(plt.hist, bins = 15, edgecolor = "black")
fig.map_lower(plt.scatter, edgecolor = "white", )
fig.map_upper(corrdot)
return fig
corr_plot(plot1_cols)
The correlation and scatter plot matrix above shows that there is a positive correlation between appliance energy consumption and lights, which means that higher appliance energy consumption is associated with higher lights consumption. The second largest correlation is between appliances and the temperature in the living room (T2). High positive correlation can be observed on temperature variable pairs and humidity variable pairs. The correlation and scatter plot matrices between energy consumption and the rest of the variables are presented on the succeeding cells.
corr_plot(plot2_cols)
There is a positive correlation between energy consumption and outside temperature (T6). A negative correlation can be observed between energy consumption and outside humidity (RH_6). As observed on the previous plot matrix, a positive correlation can be observed on the temprature variable pairs, as well as humidity variable pairs.
corr_plot(plot3_cols)
A positive correlation can be observed between energy consumption and the temperature in the ironing room (T7), the temperature in the teenager's room (T8) and the temperature in the parents' room. Again, temperature variable pairs and humidity variable pairs are also highly positively correlated.
corr_plot(plot4_cols)
The number of seconds from midnight (NSM) has the highest positive correlation with appliance energy consumption.
The training-test split is 75-25 which means the training dataset contains 14,803 observations and testing dataset with 4,932 observations. We transform the week status and day of the week variables to categorical data and convert them to dummy variables in preparation for fitting the multiple linear regression. We also set the weekday value and Sunday value as reference variables for the week status and day of the week variables dummy data, respectively. We drop the variables rv1 and rv2 from our datasets since they are random variables.
train = pd.read_csv('data/training.csv')
train.date = pd.to_datetime(train.date)
test = pd.read_csv('data/testing.csv')
test.date = pd.to_datetime(train.date)
train = train.drop(['rv1', 'rv2'], axis = 1)
test = test.drop(['rv1', 'rv2'], axis = 1)
print("Dimensions of training dataset: ", train.shape)
print("Dimensions of testing dataset: ", test.shape)
train['WeekStatus'] = train.WeekStatus.astype('category')
test['WeekStatus'] = test.WeekStatus.astype('category')
train['Day_of_week'] = train.Day_of_week.astype('category', categories = days, ordered = True)
test['Day_of_week'] = test.Day_of_week.astype('category', categories = days, ordered = True)
train_WeekStatus = pd.get_dummies(train.WeekStatus, drop_first = True)
test_WeekStatus = pd.get_dummies(test.WeekStatus, drop_first = True)
train_Day_of_week = pd.get_dummies(train.Day_of_week, drop_first = True)
test_Day_of_week = pd.get_dummies(test.Day_of_week, drop_first = True)
train = train.join(train_WeekStatus, how = "inner")
train = train.join(train_Day_of_week, how = "inner")
test = test.join(test_WeekStatus, how = "inner")
test = test.join(test_Day_of_week, how = "inner")
train = train.drop(['WeekStatus', 'Day_of_week'], axis = 1)
train.head()
test = test.drop(['WeekStatus', 'Day_of_week'], axis = 1)
test.head()
Finally, we specify the independent variables and the dependent variable that we will use when fitting the model. There are a total of 33 independent variables that we will use to predict the energy consumption through multiple regression model. We will simulate different number of covariates in the model by using the recursive feature elimination method to reduce the number of variables to the most important ones only.
X_cols = list(train.columns.drop(['date', 'Appliances']))
y_col = 'Appliances'
X_train = train[X_cols]
y_train = train[y_col]
X_test = test[X_cols]
y_test = test[y_col]
print('Dimension of X_train: ', X_train.shape)
print('Dimension of y_train: ', y_train.shape)
print('Dimension of X_test: ', X_test.shape)
print('Dimension of y_test: ', y_test.shape)
We develop the multiple linear models with recursive feature elimination in the following section. We start with using only the top 5 most important predictors and increase the number of predictors by 5 in each iteration until we have included all predictors. In each of the model built, we will display the linear equation and the most important variables that were included in the model.
Prior to starting with building the multiple regression models, we define several functions that we will use for the rest of the analysis:
rfe_mlr function provides a method to run the multiple linear regression algorithm with the recursive feature elimination.list_rfe_pred function displays the top n features selected by the RFE algorithm.model_assess function provides several model assessment metrics to compare the performance of each regression runs. The print_assess function displays the model assessment metrics results in each model. The following are used to assess the models:The comparison of assessment metrics for all the models built will be discussed in the last part of the Portfolio 2 section.
def rfe_mlr(n, df_X, df_y):
model = linear_model.LinearRegression()
rfe = RFE(estimator = model, n_features_to_select = n, step = 1)
rfe.fit(X_train, y_train)
return rfe
def list_rfe_pred(model, df_X):
features_bool = np.array(model.support_)
features = np.array(df_X.columns)
result = features[features_bool]
return result
def model_assess(model, df_X, df_y):
y_pred = model.predict(df_X)
rmse = np.sqrt(((np.array(df_y) - y_pred)**2).sum() / len(df_y))
r2 = r2_score(df_y, y_pred)
mae = np.mean(abs((df_y - y_pred)))
mape = np.mean(abs((df_y - y_pred) / df_y)) * 100
return rmse, r2, mae, mape
def print_assess(model):
print('RMSE: ', model[0])
print('R-squared: ', model[1])
print('Mean Absolute Error: ', model[2])
print('Mean Absolute Percentage Error: ', model[3])
model5 = rfe_mlr(5, X_train, y_train)
print(list_rfe_pred(model5, X_train))
print("y = x *", model5.estimator_.coef_, "+", model5.estimator_.intercept_)
Recursive feature algorithm has selected the following top 5 most important variables to be included in the model:
model5_train_assess = model_assess(model5, X_train, y_train)
print_assess(model5_train_assess)
model5_test_assess = model_assess(model5, X_test, y_test)
print_assess(model5_test_assess)
model10 = rfe_mlr(10, X_train, y_train)
print(list_rfe_pred(model10, X_train))
print("y = x *", model10.estimator_.coef_, "+", model10.estimator_.intercept_)
The following 5 variables were added as most important variables (on top of the 5 variables stated above) and were included in creation of the model:
model10_train_assess = model_assess(model10, X_train, y_train)
print_assess(model10_train_assess)
model10_test_assess = model_assess(model10, X_test, y_test)
print_assess(model10_test_assess)
model15 = rfe_mlr(15, X_train, y_train)
print(list_rfe_pred(model15, X_train))
print("y = x *", model15.estimator_.coef_, "+", model15.estimator_.intercept_)
The following 5 variables were added as most important variables (on top of the 10 variables stated above) and were included in creation of the model:
model15_train_assess = model_assess(model15, X_train, y_train)
print_assess(model15_train_assess)
model15_test_assess = model_assess(model15, X_test, y_test)
print_assess(model15_test_assess)
model20 = rfe_mlr(20, X_train, y_train)
print(list_rfe_pred(model20, X_train))
print("y = x *", model20.estimator_.coef_, "+", model20.estimator_.intercept_)
The following 5 variables were added as most important variables (on top of the 15 variables stated above) and were included in creation of the model:
model20_train_assess = model_assess(model20, X_train, y_train)
print_assess(model20_train_assess)
model20_test_assess = model_assess(model20, X_test, y_test)
print_assess(model20_test_assess)
model25 = rfe_mlr(25, X_train, y_train)
print(list_rfe_pred(model25, X_train))
print("y = x *", model25.estimator_.coef_, "+", model25.estimator_.intercept_)
The following 5 variables were added as most important variables (on top of the 20 variables stated above) and were included in creation of the model:
model25_train_assess = model_assess(model25, X_train, y_train)
print_assess(model25_train_assess)
model25_test_assess = model_assess(model25, X_test, y_test)
print_assess(model25_test_assess)
model30 = rfe_mlr(30, X_train, y_train)
print(list_rfe_pred(model30, X_train))
print("y = x *", model30.estimator_.coef_, "+", model30.estimator_.intercept_)
The following 5 variables were added as most important variables (on top of the 25 variables stated above) and were included in creation of the model:
model30_train_assess = model_assess(model30, X_train, y_train)
print_assess(model30_train_assess)
model30_test_assess = model_assess(model30, X_test, y_test)
print_assess(model30_test_assess)
model_all = rfe_mlr(35, X_train, y_train)
print(list_rfe_pred(model_all, X_train))
print("y = x *", model_all.estimator_.coef_, "+", model_all.estimator_.intercept_)
The last 3 variables added in our final iteration which includes all predictor variables are:
model_all_train_assess = model_assess(model_all, X_train, y_train)
print_assess(model_all_train_assess)
model_all_test_assess = model_assess(model_all, X_test, y_test)
print_assess(model_all_test_assess)
We combine the result of each model performance against one another. We create separate tables for model assessment metrics on the training (model_comparison_train) and testing (model_comparison_test) datasets.
model_comparison_train = pd.DataFrame(['Root Mean Square Error', 'R-squared', 'Mean Absolute Error', 'Mean Absolute Percentage Error'])
model_comparison_train['5 Variables'] = pd.DataFrame(model5_train_assess)
model_comparison_train['10 Variables'] = pd.DataFrame(model10_train_assess)
model_comparison_train['15 Variables'] = pd.DataFrame(model15_train_assess)
model_comparison_train['20 Variables'] = pd.DataFrame(model20_train_assess)
model_comparison_train['25 Variables'] = pd.DataFrame(model25_train_assess)
model_comparison_train['30 Variables'] = pd.DataFrame(model30_train_assess)
model_comparison_train['All Variables'] = pd.DataFrame(model_all_train_assess)
model_comparison_train.rename(columns={0:'Metric (Training Dataset)'}, inplace = True)
model_comparison_test = pd.DataFrame(['Root Mean Square Error', 'R-squared', 'Mean Absolute Error', 'Mean Absolute Percentage Error'])
model_comparison_test['5 Variables'] = pd.DataFrame(model5_test_assess)
model_comparison_test['10 Variables'] = pd.DataFrame(model10_test_assess)
model_comparison_test['15 Variables'] = pd.DataFrame(model15_test_assess)
model_comparison_test['20 Variables'] = pd.DataFrame(model20_test_assess)
model_comparison_test['25 Variables'] = pd.DataFrame(model25_test_assess)
model_comparison_test['30 Variables'] = pd.DataFrame(model30_test_assess)
model_comparison_test['All Variables'] = pd.DataFrame(model_all_test_assess)
model_comparison_test.rename(columns={0:'Metric (Testing Dataset)'}, inplace = True)
model_comparison_train
model_comparison_test
As we add more predictors in the model, the root mean square error, mean absolute error and mean percentage error tend to go lower which means that the predicted values get closer to the actual values on average. On the other hand, R-squared values go higher as we add more predictors to the model, however the values are not significant enough for us to conclude that our linear models are good predictors of energy consumption. While our models perform better each time we add more variables to them, the model diagnostics shows small marginal improvements. The largest improvement was observed when we increased the number of predictors from 5 variables to 10 variables. Adding more predictors thereafter only brought very little improvement. The research paper discusses a number of different machine learning algorithms that perform better in predicting energy comsumption than the multiple linear regression. These algorithms are not discussed as they go beyond the scope of this notebook.
K-means clustering is one of the simplest and popular unsupervised learning algorithms. Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes. This notebook illustrates the process of K-means clustering by generating some random clusters of data and then showing the iterations of the algorithm as random cluster means are updated.
We first generate random data around 4 centers.
center_1 = np.array([1,2])
center_2 = np.array([6,6])
center_3 = np.array([9,1])
center_4 = np.array([-5,-1])
# Generate random data and center it to the four centers each with a different variance
np.random.seed(5)
data_1 = np.random.randn(200,2) * 1.5 + center_1
data_2 = np.random.randn(200,2) * 1 + center_2
data_3 = np.random.randn(200,2) * 0.5 + center_3
data_4 = np.random.randn(200,2) * 0.8 + center_4
data = np.concatenate((data_1, data_2, data_3, data_4), axis = 0)
plt.scatter(data[:,0], data[:,1], s=7, c='k')
plt.show()
You need to generate four random centres.
This part of portfolio should contain at least:
k is set to 4;centres = np.random.randn(k,c)*std + mean where std and mean are the standard deviation and mean of the data. c represents the number of features in the data. Set the random seed to 6.green, blue, yellow, and cyan. Set the edgecolors to red.k = 4
c = 2
mean = data.mean()
std = data.std()
np.random.seed(6)
centres = np.random.randn(k, c) * std + mean
centres_df = pd.DataFrame(centres)
centres_df
data_df = pd.DataFrame(data)
data_df.head()
Define the following functions to be used in the K-means algorithm:
def euclidean_distance(a, b):
return np.sqrt(((a - b)**2).sum())
def find_cluster(a, centroid):
n = len(centroid)
closest = 0
min_dist = euclidean_distance(a, centroid.iloc[0])
for i in range(1, n):
dist = euclidean_distance(a, centroid.iloc[i])
if dist < min_dist:
min_dist = dist
closest = i
return closest
def recompute_centroid(df):
return pd.DataFrame(data_df.groupby('cluster').mean())
You need to implement the process of k-means clustering. Implement each iteration as a seperate cell, assigning each data point to the closest centre, then updating the cluster centres based on the data, then plot the new clusters.
The K-means clustering algorithm is executed as follows:
Step 1. Initialise the 4 random clusters generated (assign 4 colours for the clusters.
Step 2. Compute the distance of each data point to each of the 4 clusters and assign the nearest cluster centre appropriate colour to that data point.
Step 3. Recalculate the new cluster centres by getting the average of the x- and y-coordinates of the data points that belong to each of the current cluster centres.
Step 4. Compute the Euclidean distance between the the old cluster centres and the new cluster centres. If the distance between each points of the two sets of clusters is very small (i.e., cluster centres barely moved) or no movement at all, then stop the algorithm, otherwise, repeat from Step 2.
colors = {0 : 'green',
1 : 'blue',
2 : 'yellow',
3 : 'cyan'}
new_cluster_cntr = centres_df
def recompute_and_plot(centroid, recompute = False):
data_df['cluster'] = data_df[[0, 1]].apply(lambda x: find_cluster(x, centroid), axis = 1)
if recompute:
centroid = recompute_centroid(data_df)
plt.figure(figsize=(10, 7))
dataset = sns.scatterplot(x = data_df[0], y = data_df[1], hue = data_df.cluster, palette = colors, legend = False)
centroid = sns.scatterplot(x = centroid[0], y = centroid[1], hue = centroid.index, palette = colors, s = 75, edgecolor = 'r')
return dataset, centroid
plt.figure(figsize=(10, 7))
sns.scatterplot(x = data_df[0], y = data_df[1], color = 'k', legend = False)
sns.scatterplot(x = new_cluster_cntr[0], y = new_cluster_cntr[1], hue = new_cluster_cntr.index, palette = colors, s = 75, edgecolor = 'r')
plt.annotate('Initial State', xy = (0, 1.05), xycoords = 'axes fraction', fontsize = 15)
recompute_and_plot(new_cluster_cntr, recompute = False)
plt.annotate('Iteration 0 (Compute Nearest Cluster)', xy = (0, 1.05), xycoords = 'axes fraction', fontsize = 15)
recompute_and_plot(new_cluster_cntr, recompute = True)
plt.annotate('Iteration 1 (Recalculate Cluster Centres)', xy = (0, 1.05), xycoords = 'axes fraction', fontsize = 15)
old_cluster_cntr = new_cluster_cntr
new_cluster_cntr = recompute_centroid(data_df)
(old_cluster_cntr - new_cluster_cntr).apply(lambda x: euclidean_distance(x[0], x[1]), axis = 1)
data_df['cluster'] = data_df[[0, 1]].apply(lambda x: find_cluster(x, new_cluster_cntr), axis = 1)
recompute_and_plot(new_cluster_cntr, recompute = False)
plt.annotate('Iteration 1 (Compute Nearest Cluster)', xy = (0, 1.05), xycoords = 'axes fraction', fontsize = 15)
recompute_and_plot(new_cluster_cntr, recompute = True)
plt.annotate('Iteration 2 (Recalculate Cluster Centres)', xy = (0, 1.05), xycoords = 'axes fraction', fontsize = 15)
old_cluster_cntr = new_cluster_cntr
new_cluster_cntr = recompute_centroid(data_df)
(old_cluster_cntr - new_cluster_cntr).apply(lambda x: euclidean_distance(x[0], x[1]), axis = 1)
data_df['cluster'] = data_df[[0, 1]].apply(lambda x: find_cluster(x, new_cluster_cntr), axis = 1)
recompute_and_plot(new_cluster_cntr, recompute = False)
plt.annotate('Iteration 2 (Compute Nearest Cluster)', xy = (0, 1.05), xycoords = 'axes fraction', fontsize = 15)
recompute_and_plot(new_cluster_cntr, recompute = True)
plt.annotate('Iteration 3 (Recalculate Cluster Centres)', xy = (0, 1.05), xycoords = 'axes fraction', fontsize = 15)
old_cluster_cntr = new_cluster_cntr
new_cluster_cntr = recompute_centroid(data_df)
(old_cluster_cntr - new_cluster_cntr).apply(lambda x: euclidean_distance(x[0], x[1]), axis = 1)
data_df['cluster'] = data_df[[0, 1]].apply(lambda x: find_cluster(x, new_cluster_cntr), axis = 1)
recompute_and_plot(new_cluster_cntr, recompute = False)
plt.annotate('Iteration 3 (Compute Nearest Cluster)', xy = (0, 1.05), xycoords = 'axes fraction', fontsize = 15)
recompute_and_plot(new_cluster_cntr, recompute = True)
plt.annotate('Iteration 4 (Recalculate Cluster Centres)', xy = (0, 1.05), xycoords = 'axes fraction', fontsize = 15)
old_cluster_cntr = new_cluster_cntr
new_cluster_cntr = recompute_centroid(data_df)
(old_cluster_cntr - new_cluster_cntr).apply(lambda x: euclidean_distance(x[0], x[1]), axis = 1)
data_df['cluster'] = data_df[[0, 1]].apply(lambda x: find_cluster(x, new_cluster_cntr), axis = 1)
recompute_and_plot(new_cluster_cntr, recompute = False)
plt.annotate('Iteration 4 (Compute Nearest Cluster)', xy = (0, 1.05), xycoords = 'axes fraction', fontsize = 15)
recompute_and_plot(new_cluster_cntr, recompute = True)
plt.annotate('Iteration 5 (Recalculate Cluster Centres)', xy = (0, 1.05), xycoords = 'axes fraction', fontsize = 15)
old_cluster_cntr = new_cluster_cntr
new_cluster_cntr = recompute_centroid(data_df)
(old_cluster_cntr - new_cluster_cntr).apply(lambda x: euclidean_distance(x[0], x[1]), axis = 1)
data_df['cluster'] = data_df[[0, 1]].apply(lambda x: find_cluster(x, new_cluster_cntr), axis = 1)
recompute_and_plot(new_cluster_cntr, recompute = False)
plt.annotate('Iteration 5 (Compute Nearest Cluster)', xy = (0, 1.05), xycoords = 'axes fraction', fontsize = 15)
recompute_and_plot(new_cluster_cntr, recompute = True)
plt.annotate('Iteration 6 (Recalculate Cluster Centres)', xy = (0, 1.05), xycoords = 'axes fraction', fontsize = 15)
old_cluster_cntr = new_cluster_cntr
new_cluster_cntr = recompute_centroid(data_df)
(old_cluster_cntr - new_cluster_cntr).apply(lambda x: euclidean_distance(x[0], x[1]), axis = 1)
new_cluster_cntr.reset_index()
Based on the dataset provided, the K-means algorithm found the final cluster centres in 5 iterations. It didn't proceed with iteration 6 since the cluster centres did not change. The final cluster centres are: